The purpose of this tutorial is to introduce researchers at the Laboratory of Genomic Epidemiology of Malaria at Harvard university in the analysis of genotyping information generated through amplicon sequencing (also called Microhaplotype genotyping). This tutorial will cover:

  1. Importing and handling tables in CIGAR format in R environment.

  2. Adding metadata.

  3. Performance of the genotyping process.

  4. Molecular surveillance of drug resistance.

  5. Monitoring transmission intensity.

  6. Measuring geographic connectivity.

In this tutorial we will use a data set of 1300 samples distributed across 8 sequencing runs as a example. These samples comes from a collaboration with the group of Caucaseco, which has collected samples of P. falciparum from 2020 to 2022 in 5 municipalities located in Colombia. The data set has geographic information, however this information is encoded to protect the origin of the participants.

1 Importing and handling tables in CIGAR format in R environment

Our first step will be to call all required packages and functions in the R environment:

source('functions_and_libraries/amplseq_required_libraries.R')
source('functions_and_libraries/amplseq_functions.R')
sourceCpp('functions_and_libraries/hmmloglikelihood.cpp')

This tutorial start with .tsv table in CIGAR format that has been generated by the malaria-amplicon-pipeline of our lab. The output of this pipeline has the following structure:

\(\begin{array}{c|c:c:c:c:c:c} \text{sampleID}&Gene_1,Allele_1&Gene_1,Allele_2&...&Gene_1,Allele_k&... & Gene_m, Allele_{k_m}\\ \hline ID_1 & \text{Read counts} &&& \\ \hdashline ... &&&&\\ \hdashline ID_n &&&& \end{array}\)

Where the \(Allele\) is coded in CIGAR format, typing “.” for the reference allele, \([0-9]*[A,T,C,G]\) for each point mutation observed, and \([0-9]*[D,I]=[A,T,C,G]*\) for indels (Insertion and Deletions). The numbers denotes the position in the amplicon where the polymorphism is located

These CIGAR tables are stored in the server of the lab using the following path structure:

"STUDYNAME/RUNNAME/dada2/run_dada2/CIGARVariants.out.tsv"

As this structure is maintained across all studies in the lab, we have created a function called read_cigar_tables that only requires two arguments: paths (name of the folder of the study or the STUDYNAME), and sample_id_pattern (a pattern string in the IDs of the samples that differentiate them from controls). For this particular example, the path will be "data/sequencing_data/" , while the pattern will be "ID". In case all ".tvs" files are stored in one folder, you can use the argument files = list_of_files instead of the argument paths, where list_of_files is a vector with the names of all files we want to read.

cigar_object = read_cigar_tables(paths = "data/sequencing_data/", sample_id_pattern = "ID")

This function generates a S4 cigar object containing two data.frames, 1) a cigar_table, where genotyping information is stored in cigar format; and 2) a metadata table with 3 variables (columns), Sample_id which contains samples Id’s, run which contains the run number (useful for comparing performance across batches of samples), and typeofSamp which differentiates between controls and samples of interest. In order to view these tables use View(cigar_table$cigar_table) or View(cigar_table$metadata) in the console. If there are duplicated samples, only the replicated with the highest read depth across all loci will be keept.

2 Adding Metadata

Metadata would come from different sources, sometimes the sample ID incorporates metadata in their coding system, other times metadata can be stored in a different table, having one column specifying the sample ID. In the case of this data set, we will perform the analysis at the level of Municipality and in intervals of three months starting from the first day of collection of samples. Then, to add sampling location, month of collection and other metadata we are going to merge (using the function left_join()) our metadata table in the cigar_object to a metadata table from an external source that contain all the metadata associated with these samples. Thus, we will upload a metadata table from an external source. For this purpose we are going to import the file called metadata.csv into our R environment.

# Upload metadata from an external source
metadata = read.csv('data/metadata.csv')

# Merge the external metadata with our cigar_object
cigar_object@metadata = left_join(cigar_object@metadata,
                              metadata,
                              by = 'Sample_id')

Moreover, there are some artifacts that dada2 pipeline doesn’t resolve well like the generation of chimeras (artificial fusion of two different haplotypes during the PCR amplification) in polyclonal samples. That is the case of the alleles PF3D7_1302900,1G and PF3D7_0612900,215A that are observed in polyclonal infections only. So we are going to proceed to remove both alleles from our cigar_table. Additionally, PvDHFR locus was amplified in order to detect mixed infections with Plasmodium vivax, so we are going to remove this locus too.

# removing allele PF3D7_1302900,1G----
# this polymorphism is always present as a minor allele in polygenomic samples, increasing the proportion of polygenomic infections to up to 27%

cigar_object@cigar_table = cigar_object@cigar_table[,!grepl("PF3D7_1302900.1G", colnames(cigar_object@cigar_table))]
cigar_object@cigar_table = cigar_object@cigar_table[,!grepl("PF3D7_0612900.215A", colnames(cigar_object@cigar_table))]

# removing locus PvDHFR----
# this loci belongs to P. vivax

cigar_object@cigar_table = cigar_object@cigar_table[,!grepl("PvDHFR", colnames(cigar_object@cigar_table))]

We can use ggplot2 to plot the sample size (number of genotyped samples) by Municipality over time.

# Define the temporal unit of analysis
dates = sort(unique(cigar_object@metadata$Quarter_of_Collection))

# Define the geographic unite of analysis
pop_levels = levels(as.factor(cigar_object@metadata$Subnational_level2))
pop_colors = c("firebrick3", "dodgerblue3", "gold3", "darkseagreen3", "lightsalmon2")

plot_temporal_collection_of_samples = cigar_object@metadata %>%
  filter(!is.na(Subnational_level2)) %>% # Remove samples with no geographic information (Controls)
  summarise(nsamples= n(), .by = c(Subnational_level2, Quarter_of_Collection))%>%
  ggplot(aes(x = Quarter_of_Collection, y = nsamples, fill = factor(Subnational_level2,
                                                              levels = pop_levels)))+
  geom_col()+
  theme_bw()+
  scale_fill_manual(values = pop_colors)+
  facet_wrap(.~factor(Subnational_level2,
                      levels = pop_levels),
             strip.position = "top", ncol = 5)+
  labs(title = 'Number of Samples collected over time',
       y = "Collected samples by PCD",
       x = "Quarter")+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5),
        legend.position =  "none")+
  scale_x_discrete(limits = dates)

plot_temporal_collection_of_samples

3 Performance of the genotyping process

A first step in analyzing the genotypic data is to measure the performance of the genotyping process in terms of the proportion of samples amplified by each locus (amplification rate by locus) and the proportion of loci amplified per each sample (amplification rate per sample). In order to perform that task, we are going to convert our cigar table to a more manageable format called ampseq format, where amplicons or genes are going to be in columns, samples in rows, and the allele in cigar format is going to be in each cell of the matrix together with the read count of that allele in the sample.

\(\begin{array}{c|c:c:c:c} \text{sampleID}&Gene_1&Gene_2&...& Gene_m\\ \hline ID_1 & Allele_{i_{i \in Alleles_{Gene_1}}}\text{:Read count} &&& \\ \hdashline ... &&&&\\ \hdashline ID_n &&&& \end{array}\)

For that purpose we are going to use the function cigar2ampseq. This function requires 4 arguments. First a cigar_object (containing the cigar_table and the metadata). Then markers, which is a table with the set of markers and their names (in a column named amplicon). This argument is optional, and in case is NULL, the name of the amplicons will be extracted from the columns in the cigar_object, but information regarding chormosome location and length of the amplicon will be missing. The next arguments are min_abd and min_ratio, which are the minimum abundance (minimum read count) required to call an allele and the minimum ratio between the minor and major alleles in a polyclonal sample. By default these two arguments are 10 and 0.1 respectively. The last argument is remove_controls, by default is FALSE, but if change ti TRUE the information of the controls will be stored in a separated slot. Thus, this function automatically differentiate between controls and samples of interest, and its output is a ampseq_object S4 list with 6 slots: gt where the genotype information of the samples of interest is stored in a ampseq format; the metadata of the samples of interest; controls with all the information of the controls; markers containing the information of the markers such as the chromosome location; and other empty slots required for the next steps such as loci_performance and pop_summary.

# (alleles and abundance will be in cells)
markers = read.csv("reference/Pfal_3D7/markers.csv")
ampseq = cigar2ampseq(cigar_object, markers = markers, min_abd = 10, min_ratio = .1, remove_controls = T)

4 Read depth coverage

Once the data has been converted to the ampseq format, we can explore the read depth coverage by samples and by marker, and make comparisons between treatments, geographic regions or sequencing runs. For that we can use the function plot_coverage() which requires two arguments, the ampseq object and the variable from metadata that we want to use to split the information.

coverage_by_municipality = plot_coverage(ampseq, variable = "Subnational_level2")

The output of this function is a list with three plots

  1. A heatmap of the read depth (color intensity) of every marker (x-axis) and every sample (y-axis) by category of the chosen variable (panels)
coverage_by_municipality$plot_read_depth_heatmap

  1. A Jitter-Violin plot of the read depth (y-axis in log10 scale) of every marker and every sample (dots) by category of the chosen variable (x-axis)
coverage_by_municipality$plot_read_depth_violin 

  1. A Jitter-Violin plot of the total read depth (y-axis in log10 scale) of every sample (dots) by category of the chosen variable (x-axis)
coverage_by_municipality$plot_read_depth_violin_by_sample 

4.1 Amplification rate by locus

We can also use the function loci_amplification_rate to measure the proportion of samples that have been amplified by each amplicon marker. This function requires two arguments, the ampseq_object and threshold, which defines the minimum proportion of samples that a marker should amplify in order to be keep it for further analysis (default 0.65). All markers below this threshold will be removed and stored in a different slot in our object.

ampseq_filtered = locus_amplification_rate(ampseq, threshold = .65)

Thus in this data set 128 loci had an amplification rate above 0.65, and 15 loci were discarded. All discarded loci are stored in the slot discarded_loci within the ampseq_object. The discarded loci were: .

# Plot Amplification rate per loci ----

ggdraw()+
  draw_plot(ampseq_filtered@plots$amplification_rate_per_locus,
            x = 0,
            y = 0,
            width = 1,
            height = .5)+
  draw_plot(ampseq_filtered@plots$all_loci_amplification_rate,
            x = 0,
            y = .5,
            width = 1,
            height = .5)
**Figure 2:** Proportion of amplified samples per locus. Top figure shows the distribution of the amplification rate (or proportion of amplified samples) of loci, with the number of loci in the y-axis and the amplification rate in the x-axis. Bottom figure shows the the chromosome location of each locus (chromosomes in y-axis and position in the x-axis) and the gradient color represents its amplification rate.

Figure 2: Proportion of amplified samples per locus. Top figure shows the distribution of the amplification rate (or proportion of amplified samples) of loci, with the number of loci in the y-axis and the amplification rate in the x-axis. Bottom figure shows the the chromosome location of each locus (chromosomes in y-axis and position in the x-axis) and the gradient color represents its amplification rate.

4.2 Amplification rate by sample

The next step is to measure the proportion of loci amplified per each sample, also called the amplification rate of the samples. For that purpose we are going to use the function sample_amplification_rate that also requires only two arguments, the ampseq_object and a threshold, which is the minmum proportion of loci that a sample should amplified (by default 0.8).

ampseq_filtered = sample_amplification_rate(ampseq_filtered)

ampseq_filtered@plots$samples_amplification_rate+
         theme(axis.text = element_text(size = 12),
               axis.title = element_text(size = 12))
**Figure 3:** Proportion of amplified loci per sample. The figure shows the distribution of the amplification rate (or proportion of amplified loci) of the samples, with the number of samples in the y-axis and the amplification rate in the x-axis.

Figure 3: Proportion of amplified loci per sample. The figure shows the distribution of the amplification rate (or proportion of amplified loci) of the samples, with the number of samples in the y-axis and the amplification rate in the x-axis.

5 Molecular surveillance of drug resistance

Surveillance of polymorphisms or allelic variants associated with resistance against antimalarials is of the utmost importance in public health. That is why our panel of markers includes 5 genes (9 markers in total, some genes are genotyped by more than one marker) that have polymorphisms associated with antimalarial resistance. In this section we will identify the different haplotypes of these genes present in our data set, and we will determine the frequency of each haplotype in each study area and in each quarter of the year. For this we will use the function drug_resistant_haplotypes which has 7 arguments. The first argument is the ampseq_object. The second is a table with the reference alleles for each gene, and the polymorphisms associated with resistance. The structure of this table is as follows:

\[\begin{array}{c|c:c:c} \text{Chromosome} & \text{Gene_ID} & \text{Description} & \text{Mutation} & \text{Anotation} \\ \text{Pf3D7_04_v3} & \text{PF3D7_0417200} & dhfr & \text{N51I} & \text{Pyrimethamine resistance} \\ ... \end{array}\]

Where the mutation describes the reference amino acid (sensitive to the drug), the position in the amino acid chain, and finally the amino acid associated with resistance.

The third and fourth arguments refer to the common name of the gene (or the name by which it is identified in the markers table in the ampseq_object) and the gene ID in the gff file of the referential strain (in this case strain 3D7). The fifth and sixth arguments are the .gff and .fasta files of the genome of the reference strain, and the last argument, variables, is a vector that has the columns of the metadata table that will be used to define the subpopulations and generate the report graph. For this particular example we will define as variables = c('Sample_id', 'Subnational_level2', 'Quarter_of_Collection'), where Sample_id, is the column that we will use to map the data, Subnational_level2 is the column to define the area of study, and Quarter_of_Collection defines the time scale.

The function generates a list with 3 tables and a graph. The first table contains the haplotypes found in each sample (rows) for each gene (columns). Each number in the haplotype indicates a position in the amino acid chain, the letter before the number indicates the reference allele (with sensitive phenotype), and the letter after the number is the allele observed in the sample. If the letters are in capital letters, that allele in that position has been previously reported as resistant, whereas the letters in lower case indicate that it is the first time that position has been reported as polymorphic. In case the sample is polyclonal, each allele observed is separated by the following symbol |, and the order in which they are reported is based on their read count. Null data, defined as sections of the gene that have not been amplified, are completed with the question mark ?.

The second table contains the inferred phenotype for each sample (rows) for each gene (columns). The third table contains the count of each haplotype in the population (number of times observed) and its frequency in the population (defined based on geographic and temporal information). It is important to note that polyclonal samples are included in this calculation, so the denominator is defined as the total number of haplotypes found in the populations. Remember that if we run this analysis after filtering not amplified samples and loci, some polymorphism will not be included. To report all mutations included in our panel of markers associated with drug resistance, we have to recover that information from the slot discarded_loci.

drug_resistant_haplotypes_plot = drug_resistant_haplotypes(ampseq_filtered,
                                                         reference_alleles = 'reference/Pfal_3D7/drugR_alleles.csv',
                                        gene_names = c('PfDHFR',
                                                       'PfMDR1',
                                                       'PfDHPS',
                                                       'PfKelch13',
                                                       'PF3D7_1447900'),
                                        gene_ids = c('PF3D7_0417200',
                                                     'PF3D7_0523000',
                                                     'PF3D7_0810800',
                                                     'PF3D7_1343700',
                                                     'PF3D7_1447900'),
                                        gff_file = "reference/Pfal_3D7/PlasmoDB-59_Pfalciparum3D7.gff",
                                        fasta_file = "reference/Pfal_3D7/PlasmoDB-59_Pfalciparum3D7_Genome.fasta",
                                        variables = c('Sample_id', 'Subnational_level2', 'Quarter_of_Collection'),
                                        na.var.rm = TRUE,
                                        filters = c('Subnational_level2;Municipality 1,Municipality 2,Municipality 3,Municipality 4,Municipality 5',                               'Quarter_of_Collection;2020-Q4,2021-Q1,2021-Q2,2021-Q3,2021-Q4,2022-Q1,2022-Q2,2022-Q3'))

The colors assigned to the haplotypes in the bar blot are randomly chosen, see below:

drug_resistant_haplotypes_plot$haplo_freq_plot

However, we can manually assign colors based on the resistance profile of the haplotye:

blues = brewer.pal(9, 'Blues')
reds = brewer.pal(9, 'Reds')

drug_resistant_haplotypes_plot$haplo_freq_plot+
  theme(axis.text.x = element_text(angle = 90))+
  scale_fill_manual(values = c(
    'gray75',blues[6], reds[3], # MDR2
    reds[c(8,7)], blues[3], reds[6], blues[6], reds[c(9,3)], # DFHR
    'gray75',blues[4], reds[5], blues[c(2,6)], reds[c(4,9,6,8)], # DHPS
    blues[6], #K13
    'gray75', reds[c(5,4)], blues[2], reds[6] # MDR1
  ))+
  labs(y = 'Haplotype prevalence in infected individuals')
**Figure 4:** Frequency of haplotypes of genes associated with drug resistance. y-axis shows the frequency in the sub-population, x-axis shows the quarter of the year where the sample was collected, rows panels corresponds to the five study areas, and column panels represents each genotyped gene. Colors were assigned based on the possible phenotype, dark blue indicates sensitive, while dark red indicates resistance. Light colors where assigned for partial haplotype information, and gray for complete missging data

Figure 4: Frequency of haplotypes of genes associated with drug resistance. y-axis shows the frequency in the sub-population, x-axis shows the quarter of the year where the sample was collected, rows panels corresponds to the five study areas, and column panels represents each genotyped gene. Colors were assigned based on the possible phenotype, dark blue indicates sensitive, while dark red indicates resistance. Light colors where assigned for partial haplotype information, and gray for complete missging data

6 Monitoring transmission intensity

The transmission of Malaria has an effect on the evolutionary processes that affect the parasite population. Thus, various genetic indicators are used to identify changes in the intensity of transmission at a spatial and temporal level. One of these indicators is the prevalence of polyclonal infections, or the coexistence of two or more different clones (haplotypes) within the infected individual. As the human parasite genome is haploid, the presence of a single haplotype is expected. The presence of two or more different haplotypes in the sample may be due to the transmission of two or more different genotypes from the same bite (co-infection), or the acquisition of parasites by multiple infections or independent bites (super-infection); and both processes are related to the intensity of transmission. In low transmission there are few infective bites, so there is little chance that a person can acquire a super-infection or co-infection, and therefore most infections are monoclonal. On the other hand, when the necessary conditions exist for the increase in the mosquito population and the interaction between it and the human is favored, super-infections and co-infections increase, generating an increase in the prevalence of polyclonal infections.

The genetic relatedness between parasites is also affected by the intensity of transmission. As we mentioned previously, in low transmission scenarios monoclonal infections are favored, so when an infected person is bitten, a single haplotype is transmitted and self-fertilization of the parasite occurs in the mosquito (sexual reproduction of two genetically identical or related haplotypes). This mechanism of propagation by inbreeding is called clonal propagation, and it causes all the alleles of a haplotype to segregate together and not independently, thus increasing the genetic relatedness in the population.

The parasite effective population size (number of individuals in the population that explains the magnitude of the effect of genetic drift in the population) and its diversity, are also influenced by transmission. The increase in transmission is a consequence of the reproductive success of the different lineages of circulating parasites and the increase in their effective population size. These circumstances increase the introduction of new variants in the population (increase in the population mutation rate or \(\theta\)), the number of polymorphic sites, the richness of alleles of each polymorphic site (number of alleles per locus), and genetic diversity in terms of heterozygosity of the population (probability of selecting two different alleles by chance). On the other hand, when transmission decreases and the effective population size is reduced, and genetic drift causes the random fixation of alleles, and the diversity of the population is reduced. However, the effective size of the population and its diversity are also strongly influenced by the introduction of new variants as a result of migration and the genetic structure of the parasite population, so the relationship between these metrics and transmission is not linear.

6.1 Complexity of infection and transmission

In this section we will focus on describing the relationship between the intensity of transmission, inferred through the number of samples collected, and the proportion of polyclonal infections. To do this, we will first use the function get_polygenomic to identify polyclonal infections, determine their complexity of infection (number of different clones coexisting in the sample), and the number of heterozygous loci per sample. This information is automatically added to the metadata table of the ampseq_object. We will then proceed to describe the distribution of heterozygous loci per sample, the COI, and the proportion of polyclonal infections in each population. Finally, we will describe the temporal change in the proportion of polyclonal infections and its association with the number of samples collected.

ampseq_filtered = get_polygenomic(ampseq_object = ampseq_filtered, 
                                  strata = "Subnational_level2",
                                  na.rm = FALSE,
                                  filters = NULL)

The function get_polygenomic generates 3 tables:

  1. A vector that describes for each loci the proportion of polyclonal samples detected. This vector is added to the loci_performance table in our ampseq_object if we set update_popsummary = T. This information is useful in order to identify genotyping errors that are not filtered out in the denoising pipeline and the identification of chimeras by dada2. That is the case of the alleles PF3D7_1302900,1G and PF3D7_0612900,215A for instance.
ampseq_filtered@loci_performance %>% 
  select(loci, prop_poly_detected)%>%datatable(rownames = F)%>%
  formatRound(columns='prop_poly_detected', digits=3)
  1. A table that describes for each sample:
  • The number of heterozygous loci (NPolyLoci) or heterozygous amplicons in the sample.
  • The name of the heterozygous loci (Polyloci).
  • The observed alleles in each heterozygous loci (alleles_at_loci) where alleles within the same loci are separated by “_“, while alleles from different loci are separated by”/“.
  • The number of alleles observed in each heterozygous loci (nalleles_per_loci) where values from different loci are separated by “/”.
  • And finally the complexity of the infection (coi) defined as the maximum number of alleles observed in any of the heterozygous loci in the sample. As this definition is permissible as samples the only have one heterozygous loci can be considered polyclonal. The user can easily change that definition to a more stringent incorporating the information of the minimum number of heterozygous loci required to be considered polyclonal. In future updates we will incorporate the Within host divergence (\(F_{ws}\)) index.

This table is added to the metadata table in our ampseq_object if we set update_popsummary = T.

ampseq_filtered@metadata %>%
  select(Sample_id,NPolyLoci, Polyloci, alleles_at_loci, nalleles_per_loci, coi)%>%
  datatable(rownames = F)
  1. A pop summary table that summarize the mean COI in each population, the total number of polyclonal infections, and the proportion polyclonal infections with its confident intervals at 95%.
ampseq_filtered@pop_summary%>%
  datatable(rownames = F) %>%
  formatRound(columns=c('mean_coi', 'prop_poly', 'prop_poly_lower', 'prop_poly_upper'),
              digits=3)

We can visually inspect all this information by using histograms. Here is an example with the number of heterozygous loci per sample by sampling location. We will use the function log_scale_histogram because the number of samples with only one heterozygous loci is high with respect to the number of samples with more than 10 heterozygous loci.

plot_log_NPolyLoci_by_pop = log_scale_histogram(data = ampseq_filtered@metadata[ampseq_filtered@metadata$NPolyLoci != 0,],
                                                var = "NPolyLoci", binwidth = 1, group_by = "Subnational_level2",
                                                levels = pop_levels, x_label = "Number of heterozygous loci per sample",
                                                fill_color = pop_colors,
                                                y_breaks = c(1, 5,10, 30), ncol = 4)+
  labs(title = 'Distribution of # heterozygous loci per sample', 
       y = "# Samples",
       x = "# heterozygous loci per sample")

plot_log_NPolyLoci_by_pop
**Figure 5:** Distribution of the number of heterozygous loci per samples per population.

Figure 5: Distribution of the number of heterozygous loci per samples per population.

We can do the same with the COI by sample.

plot_log_coi_by_pop = log_scale_histogram(data = ampseq_filtered@metadata,
                                          var = "coi",
                                          binwidth = 1,
                                          group_by = "Subnational_level2",
                                          levels = pop_levels,
                                          x_label = "COI",
                                          fill_color = pop_colors,
                                          y_breaks = c(1,2, 5,10, 20, 50, 100,200, 400), ncol = 5)+
  labs(title = 'Distribution of COI by sampling location',
       y = "# Samples")

plot_log_coi_by_pop
**Figure 6:** Distribution of the complexity of infection (COI) by study area. y-axis is in logarithm scale.

Figure 6: Distribution of the complexity of infection (COI) by study area. y-axis is in logarithm scale.

In the case of the proportion of polyclonal infections by study area, we can use a a barplot.

plot_poly_by_pop = ampseq_filtered@pop_summary %>% ggplot(aes(x = factor(pop, levels = c(pop_levels, "Total")),
                                                          y = prop_poly,
                                                          fill = factor(pop, levels = c(pop_levels, "Total"))))+
  geom_col(alpha = .6) +
  geom_errorbar(aes(ymin = prop_poly_lower, ymax = prop_poly_upper), width = .2)+
  theme_bw() +
  labs(title = "Frequency of polyclonal infections",
       y = "Frecquency") +
  scale_fill_manual(values = c(pop_colors, "gray30"))+
  theme(axis.text = element_text(size = 12),
        axis.title = element_blank(),
        legend.position = "none")

plot_poly_by_pop
**Figure 7:** Proportion of polyclonal infections per study area.

Figure 7: Proportion of polyclonal infections per study area.

We can also explore temporal trends in the proportion of polyclonal infections. First we are going to create a new variable in our metadata concatenating the information of the study area (Subnational_level2) and the information of the Quarter_of_Collection.

ampseq_filtered@metadata %<>% mutate(Pop_quarter = paste(Subnational_level2, Quarter_of_Collection, sep = '_'))

Then we will use again the function get_polygenomic, using our new variable in the slot of strata and the slot filters equals to 'Municipality 1|Municipality 2|Municipality 3', in this way we will just calculate the proportion of polyclonal infections for the populations that have samples for most time points. As we don’t want to overwrite our previous analysis, we will set update_popsummary = F and store the output in a new object.

poly_by_pop_over_time = get_polygenomic(ampseq_object = ampseq_filtered,
                                        strata = "Pop_quarter",
                                        update_popsummary = F,
                                        na.rm = TRUE,
                                        filters = 'Municipality 1|Municipality 2|Municipality 3')

poly_by_pop_over_time
##                       pop    n mean_coi n_poly  prop_poly prop_poly_lower
## 1                   Total 1059 1.115203    118 0.11142587    0.0931027507
## 2  Municipality 1_2020-Q3    2 1.000000      0 0.00000000    0.0000000000
## 3  Municipality 1_2020-Q4   11 1.000000      0 0.00000000    0.0000000000
## 4  Municipality 1_2021-Q1   13 1.000000      0 0.00000000    0.0000000000
## 5  Municipality 1_2021-Q2   22 1.045455      1 0.04545455    0.0011501475
## 6  Municipality 1_2021-Q3   56 1.017857      1 0.01785714    0.0004520015
## 7  Municipality 1_2021-Q4   99 1.090909      9 0.09090909    0.0424164701
## 8  Municipality 1_2022-Q1  102 1.068627      7 0.06862745    0.0280352646
## 9  Municipality 1_2022-Q2  119 1.075630      8 0.06722689    0.0294685350
## 10 Municipality 1_2022-Q3   49 1.183673      9 0.18367347    0.0875903353
## 11 Municipality 2_2020-Q4   29 1.103448      3 0.10344828    0.0218637368
## 12 Municipality 2_2021-Q1   30 1.033333      1 0.03333333    0.0008435709
## 13 Municipality 2_2021-Q2   11 1.000000      0 0.00000000    0.0000000000
## 14 Municipality 2_2021-Q3   38 1.026316      1 0.02631579    0.0006660362
## 15 Municipality 2_2021-Q4   16 1.062500      1 0.06250000    0.0015811117
## 16 Municipality 2_2022-Q1   13 1.000000      0 0.00000000    0.0000000000
## 17 Municipality 2_2022-Q2   23 1.043478      1 0.04347826    0.0011001686
## 18 Municipality 2_2022-Q3   34 1.176471      5 0.14705882    0.0495284553
## 19 Municipality 3_2020-Q3   13 1.230769      3 0.23076923    0.0503810735
## 20 Municipality 3_2020-Q4   51 1.196078     10 0.19607843    0.0982434165
## 21 Municipality 3_2021-Q1   61 1.278689     17 0.27868852    0.1714720843
## 22 Municipality 3_2021-Q2   25 1.120000      3 0.12000000    0.0254653966
## 23 Municipality 3_2021-Q3   27 1.222222      5 0.18518519    0.0630000065
## 24 Municipality 3_2021-Q4   19 1.210526      4 0.21052632    0.0605245377
## 25 Municipality 3_2022-Q1   29 1.068966      1 0.03448276    0.0008726469
## 26 Municipality 3_2022-Q2  110 1.136364     15 0.13636364    0.0783762502
## 27 Municipality 3_2022-Q3   57 1.228070     13 0.22807018    0.1273971076
##    prop_poly_upper
## 1        0.1319392
## 2        0.8418861
## 3        0.2849142
## 4        0.2470526
## 5        0.2284444
## 6        0.0955259
## 7        0.1655690
## 8        0.1362970
## 9        0.1281695
## 10       0.3202212
## 11       0.2735152
## 12       0.1721695
## 13       0.2849142
## 14       0.1380990
## 15       0.3023207
## 16       0.2470526
## 17       0.2194866
## 18       0.3105657
## 19       0.5381315
## 20       0.3311574
## 21       0.4082875
## 22       0.3121903
## 23       0.3808299
## 24       0.4556531
## 25       0.1776443
## 26       0.2149201
## 27       0.3583634

Now we can generate a bar plot using ggplot

plot_poly_by_pop_over_time = poly_by_pop_over_time %>%
  filter(pop != 'Total')%>%
  mutate(
  Population = stringr::str_split(pop, '_', simplify = TRUE)[,1],
  Date = stringr::str_split(pop, '_', simplify = TRUE)[,2],
  prop_poly_lower = case_when(
      prop_poly == 0 ~ 0,
      prop_poly != 0 ~ prop_poly_lower),
      prop_poly_upper = case_when(
        prop_poly == 0 ~ 0,
        prop_poly != 0 ~ prop_poly_upper)
  )%>%
  ggplot(aes(x = Date,
             y = prop_poly,
             ymin = prop_poly_lower,
             ymax = prop_poly_upper,
             fill = factor(Population, levels = pop_levels)))+
  geom_col()+
  geom_errorbar(width = .2)+
  facet_wrap(~factor(Population, levels = pop_levels), ncol = 5)+
  theme_bw()+
  scale_fill_manual(values = pop_colors)+
  labs(title = 'Temporal change of the proportion of polyclonal infections',
       y = "Polyclonal infections",
       x = "Date of Collection")+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5),
        legend.position =  "none")

plot_poly_by_pop_over_time
**Figure 8:** Proportion of polyclonal infections per study area over year quarters.

Figure 8: Proportion of polyclonal infections per study area over year quarters.

Finally, we can test if there is correlation between the proportion of polyclonal infections and the number of collected samples per temporal unit (as an approximation of the malaria burden).

For that we will joint our table poly_by_pop_over_time with a summary of our metadata table.

nsamples_vs_proppoly = 
  left_join(poly_by_pop_over_time%>% 
              select(pop, prop_poly), metadata %>%
              mutate(Quarter_of_Collection = 
                       paste(substr(Date_of_Collection, 1, 4), quarters.Date(Date_of_Collection), sep='-')) %>%
              summarise(nsamples= n(), .by = c(Subnational_level2, Quarter_of_Collection)) %>% 
              mutate(pop = paste(Subnational_level2, Quarter_of_Collection, sep = '_')), by = 'pop') %>%
  filter(Subnational_level2 %in% c('Municipality 1', 'Municipality 2', 'Municipality 3'))

Then we can plot the correlation using the function ggscatter.

plot_cor_proppoly_nsamples = ggscatter(nsamples_vs_proppoly, x = "prop_poly", y = "nsamples", 
          add = "reg.line", conf.int = TRUE, 
          cor.coef = TRUE, cor.method = "spearman",
          title = '# Samples vs. Prop. Poly',
          xlab = "Proportion of polyclonal samples", ylab = "Number of samples", facet.by = 'Subnational_level2')

plot_cor_proppoly_nsamples
**Figure 9:** Spearman correlation of the number of collected samples and proportion of polyclonal infections per year quarters. Each point represent the measurement of each year quarter in each study area.

Figure 9: Spearman correlation of the number of collected samples and proportion of polyclonal infections per year quarters. Each point represent the measurement of each year quarter in each study area.

6.2 Relatedness and Transmission

Relatedness is defined as the probability that, at any locus in the genome, the alleles sampled from two different individuals are identical by descent (IBD). A Hidden Markov Model based method was developed for the estimation of relatedness (HMMIBD), where the relatedness or \(r\) is estimated as a function of the haplotype of the two sampled parasites (\(Y_i\) and \(Y_j\)), the frequency of the alleles in the population, the physical distance between loci, the recombination rate (\(\rho\)), a switching rate of the Markov chain (\(k\)), and a constant genotyping error rate (\(\varepsilon\)).

In this section we will measure the pairwise genetic relatedness, and calculate the proportion of pairwise comparisons that have a relatedness \(\ge 0.99\) within and between operational units. For that we are going to use the following functions:

  1. ampseq2loci: converts the ampseq_object into a loci_object that only contains the allele information (read abundance is removed). This function only requires the ampseq_object as an input argument.
# generating the loci_object----
loci_object = ampseq2loci(ampseq_filtered)
  1. pairwise_hmmIBD: computes pairwise genetic relatedness using the HMMIBD algorithm implemented in panlejugde R package (https://github.com/aimeertaylor/paneljudge). As an input it requires a loci_object and automatically all pairwise comparisons will be computed. Its output is a long format data.frame of all pairwise comparisons. This function is computationally exhaustive, so it has parallel argument to speed up the calculation. In the current example, the matrix is been generated separately in the cluster at Broadinstitute, and the .csv file is loaded by using the function read.csv.
# measuring relatedness----

if(file.exists('pairwise_relatedness.csv')){
  pairwise_relatedness = read.csv('pairwise_relatedness.csv')
}else{
  pairwise_relatedness = NULL
  
  for(w in 1:500){
    start = Sys.time()
    pairwise_relatedness = rbind(pairwise_relatedness,
                                 pairwise_hmmIBD(loci_object, parallel = T, w = w, n = 500))
    time_diff = Sys.time() - start
    
    print(paste0('step ', w, ' done in ', time_diff, ' secs'))
    
  }
  
  write.csv(pairwise_relatedness, 'pairwise_relatedness.csv', row.names = F, quote = F)
}
  1. plot_relatedness_distribution: visualizes the distribution of the pairwise relatedness within and between the different study areas. Its arguments are relatedness_matrix, metadata, Population, and fill_color.
plot_relatedness_distribution_within = plot_relatedness_distribution(
  pairwise_relatedness = pairwise_relatedness,
  metadata = ampseq@metadata,
  Population = 'Subnational_level2',
  fill_color = pop_colors,
  type_pop_comparison = 'within',
  ncol = 5,
  pop_levels = pop_levels
)

plot_relatedness_distribution_within$plot
**Figure 10:** Histogram of the distribution of pairwise genetic relatedness by study area (Panels). The dashed vertical line represents the average genetic relatedness in the overall population.

Figure 10: Histogram of the distribution of pairwise genetic relatedness by study area (Panels). The dashed vertical line represents the average genetic relatedness in the overall population.

  1. plot_frac_highly_related: visualizes the fraction of highly related haplotypes within and between study areas. It has the same arguments than the previous function plus threshold (minimum value to consider two samples to be highly related).
plot_frac_highly_related_within = plot_frac_highly_related(
  pairwise_relatedness = pairwise_relatedness,
  metadata = ampseq@metadata,
  Population = 'Subnational_level2',
  fill_color = pop_colors,
  threshold = 0.99, type_pop_comparison = 'within', pop_levels = pop_levels)

plot_frac_highly_related_within$plot
**Figure 11:** Proportion of highly related samples (IBD > 0.99) in each study area.

Figure 11: Proportion of highly related samples (IBD > 0.99) in each study area.

  1. plot_frac_highly_related_over_time: visualizes the fraction of highly related haplotypes within and between study areas, and by a temporal variable. It has the same arguments than the previous function plus threshold (minimum value to consider two samples to be highly related).
plot_frac_highly_related_over_quarters_within = plot_frac_highly_related_over_time(pairwise_relatedness = pairwise_relatedness,
                                       metadata = ampseq@metadata,
                                       Population = c('Subnational_level2', 'Quarter_of_Collection'),
                                       fill_color = pop_colors,
                                       threshold = 0.99,
                                       type_pop_comparison = 'within',
                                       pop_levels = pop_levels, ncol = 5)

plot_frac_highly_related_over_quarters_within$plot_frac_highly_related
**Figure 12:** Proportion of highly related samples over year quarters in each study area.

Figure 12: Proportion of highly related samples over year quarters in each study area.

nsamples_vs_highlyR = left_join(plot_frac_highly_related_over_quarters_within$plot_frac_highly_related$data %>% mutate(pop = paste(Pop_comparison, Date_Yi, sep = '_'))%>%
                                   ungroup()%>%
                                   select(pop, prop),
          metadata %>%
  filter(!is.na(Subnational_level2)) %>%
  group_by(Subnational_level2, Quarter_of_Collection) %>%
  summarise(nsamples= n()) %>% mutate(pop = paste(Subnational_level2, Quarter_of_Collection, sep = '_')), by = 'pop')  %>%
  filter(prop != 0, Subnational_level2 %in% c('Municipality 1', 'Municipality 2', 'Municipality 3'))

plot_cor_highlyR_nsamples = ggscatter(nsamples_vs_highlyR, x = "prop", y = "nsamples", 
          add = "reg.line", conf.int = TRUE, 
          cor.coef = TRUE, cor.method = "spearman",
          title= 'D) # Collected samples vs. Proportion of highly related samples',
          xlab = "Proportion of highly related samples", ylab = "Number of samples",
          facet.by = 'Subnational_level2')
  1. plot_network: creates a network representation of the pairwise relatedness information. The function requires 7 arguments: 1) the pairwise_relatedness a long format data.frame with all pairwise comparisons, 2) threshold to define the minimum prairwise relatedness to plot a branch between the two compared samples, 3) a metadata table containing the variables the we want to include in the plot, 4) sample_id name of the column in metadata table where sample ids are located, 5) group_by variable from metadata table that we are going to use to assign colors to the nodes in the network, 6) levels order of the elements of group_by variable, and finally 7) colors that are going to be assigned to each element from group_by, in this case one color per population.
Relatedness_network = plot_network(pairwise_relatedness = pairwise_relatedness,
                               threshold = .99,
                               metadata = ampseq_filtered@metadata,
                               sample_id = 'Sample_id',
                               group_by = 'Subnational_level2',
                               levels = pop_levels,
                               colors = pop_colors
                              )
**Figure 14:** Network representation of pairwise genetic relatedness between samples. Each node represents a sample, and branches are displayed only for pairwise comparisons with a relatedness greater than 0.99. Colors were assigned based on the sampling location: Municipality 1 (Red), Municipality 2 (Blue), Municipality 3 (Gold), Municipality 4 (Green), and Municipality 5 (Coral).

Figure 14: Network representation of pairwise genetic relatedness between samples. Each node represents a sample, and branches are displayed only for pairwise comparisons with a relatedness greater than 0.99. Colors were assigned based on the sampling location: Municipality 1 (Red), Municipality 2 (Blue), Municipality 3 (Gold), Municipality 4 (Green), and Municipality 5 (Coral).

7 Measuring geographic connectivity

Malaria transmission is heterogeneous throughout the Colombian territory, and it is mainly concentrated in transmission foci where demographic and environmental factors favor the interaction between the two hosts of the parasite, humans and Anopheles mosquitoes. Operationally, disease control units are defined based on sub-national administrative divisions. The connectivity or dispersal pattern of the parasite between these geographic units may limit progress in disease control if activities are not properly coordinated. This connectivity is mainly influenced by human movement, and can be inferred by estimating the proportion of genetically closely related parasites between the different areas.

We can use again the function plot_relatedness_distribution to create histograms of the diistribution of relatenes between samples from different study sites. In this case, we are going to set type_pop_comparison = 'between'.

plot_relatedness_distribution_between = plot_relatedness_distribution(
  pairwise_relatedness = pairwise_relatedness,
  metadata = ampseq_filtered@metadata %>% mutate(Pop = gsub('unicipality ','', Subnational_level2)),
  Population = 'Pop',
  fill_color = rep('gray50', 10),
  type_pop_comparison = 'between',
  ncol = 5,
  pop_levels = c("M1-M2", "M1-M3", "M1-M4",
                  "M1-M5", "M2-M3", "M2-M4", 
                  "M2-M5", "M3-M4", "M3-M5", 
                  "M4-M5")
)

plot_relatedness_distribution_between$plot
**Figure 15:** Histogram distribution of pairwise genetic relatedness between sampling locations. Panels in columns correspond to the three study areas. The dotted vertical lines represent the average genetic relatedness in the whole population.

Figure 15: Histogram distribution of pairwise genetic relatedness between sampling locations. Panels in columns correspond to the three study areas. The dotted vertical lines represent the average genetic relatedness in the whole population.

Again the fraction of highly related haplotypes between sites can be explored with the function plot_frac_highly_related and setting type_pop_comparison = 'between'.

plot_frac_highly_related_between = plot_frac_highly_related(
  pairwise_relatedness = pairwise_relatedness,
  metadata = ampseq_filtered@metadata %>% mutate(Pop = gsub('unicipality ','', Subnational_level2)),
  Population = 'Pop',
  fill_color = rep('gray50', 10),
  threshold = 0.99, type_pop_comparison = 'between',
  pop_levels = c("M1-M2", "M1-M3", "M1-M4",
                  "M1-M5", "M2-M3", "M2-M4", 
                  "M2-M5", "M3-M4", "M3-M5", 
                  "M4-M5"))

plot_frac_highly_related_between$plot
**Figure 8:** Proportion of highly related samples between sampling locations.

Figure 8: Proportion of highly related samples between sampling locations.